AITopics | key frame

Collaborating Authors

key frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GTQuery-based flowOp4cal flow

Neural Information Processing SystemsApr-25-2026, 07:06:25 GMT

Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong querybased image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame.

machine learning, natural language, segmentation, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

End-to-end Speech Recognition with similar length speech and text

Fan, Peng, Wang, Wenping, Deng, Fei

arXiv.org Artificial IntelligenceOct-14-2025

The mismatch of speech length and text length poses a challenge in automatic speech recognition (ASR). In previous research, various approaches have been employed to align text with speech, including the utilization of Connectionist Temporal Classification (CTC). In earlier work, a key frame mechanism (KFDS) was introduced, utilizing intermediate CTC outputs to guide downsampling and preserve keyframes, but traditional methods (CTC) failed to align speech and text appropriately when downsampling speech to a text-similar length. In this paper, we focus on speech recognition in those cases where the length of speech aligns closely with that of the corresponding text. To address this issue, we introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss, which is based on edit distance. To enhance the information on keyframes, we incorporate frame fusion by applying weights and summing the keyframe with its context 2 frames. Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86\% in the number of frames.

artificial intelligence, machine learning, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2510.10453

Country: Asia > China (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Re-Identifying Kākā with AI-Automated Video Key Frame Extraction

Maddigan, Paula, Lensen, Andrew, Shaw, Rachael C.

arXiv.org Artificial IntelligenceOct-13-2025

Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of kākā (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in kākā re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising kākā individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.08775

Country:

North America > United States (0.46)
Oceania > New Zealand (0.34)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

Cao, Xinye, Guo, Hongcan, Qian, Jiawen, Nan, Guoshun, Wang, Chao, Pan, Yuqi, Hou, Tianhao, Wang, Xiaojuan, Gao, Yutong

arXiv.org Artificial IntelligenceOct-8-2025

Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.

large language model, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2510.0604

Country:

Europe (1.00)
North America > United States (0.96)
Asia (0.68)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

User-Centric Communication Service Provision for Edge-Assisted Mobile Augmented Reality

Zhou, Conghao, Gao, Jie, Hu, Shisheng, Cheng, Nan, Zhuang, Weihua, Shen, Xuemin

arXiv.org Artificial IntelligenceOct-1-2025

Future 6G networks are envisioned to facilitate edge-assisted mobile augmented reality (MAR) via strengthening the collaboration between MAR devices and edge servers. In order to provide immersive user experiences, MAR devices must timely upload camera frames to an edge server for simultaneous localization and mapping (SLAM)-based device pose tracking. In this paper, to cope with user-specific and non-stationary uplink data traffic, we develop a digital twin (DT)-based approach for user-centric communication service provision for MAR. Specifically, to establish DTs for individual MAR devices, we first construct a data model customized for MAR that captures the intricate impact of the SLAM-based frame uploading mechanism on the user-specific data traffic pattern. We then define two DT operation functions that cooperatively enable adaptive switching between different data-driven models for capturing non-stationary data traffic. Leveraging the user-oriented data management introduced by DTs, we propose an algorithm for network resource management that ensures the timeliness of frame uploading and the robustness against inherent inaccuracies in data traffic modeling for individual MAR devices. Trace-driven simulation results demonstrate that the user-centric communication service provision achieves a 14.2% increase in meeting the camera frame uploading delay requirement in comparison with the slicing-based communication service provision widely used for 5G.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2509.25905

Country:

Europe (0.67)
North America > United States (0.46)
Asia > China (0.46)
North America > Canada > Ontario (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Telecommunications (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Information Management (1.00)
Information Technology > Communications > Networks (1.00)
(4 more...)

Add feedback

KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models

Li, Sibo, Hao, Qianyue, Shang, Yu, Li, Yong

arXiv.org Artificial IntelligenceSep-26-2025

Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating transformers computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot's motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68 acceleration compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld-E43D. Robotic world models are generative frameworks that predict future environment states based on an initial observation and a conditioning input (Ding et al., 2024; Agarwal et al., 2025).

artificial intelligence, key frame, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2509.21027

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

Yu, Jiawen, Liu, Hairuo, Yu, Qiaojun, Ren, Jieji, Hao, Ce, Ding, Haitong, Huang, Guangyu, Huang, Guofan, Song, Yan, Cai, Panpan, Lu, Cewu, Zhang, Wenqiang

arXiv.org Artificial IntelligenceSep-19-2025

Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose ForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces FVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong pi_0-based baselines, achieving up to 80% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025.

artificial intelligence, arxiv preprint arxiv, forcevla, (14 more...)

arXiv.org Artificial Intelligence

2505.22159

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (0.48)

Add feedback

NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding

Zhao, Running, Jiang, Zhihan, Zhang, Xinchen, Chang, Chirui, Chen, Handi, Deng, Weipeng, Jin, Luyao, Qi, Xiaojuan, Qian, Xun, Ngai, Edith C. H.

arXiv.org Artificial IntelligenceAug-21-2025

Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users' expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt's interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: https://zhaorunning.github.io/NoteIt/

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746059.3747626

2508.14395

Country:

Europe (1.00)
Asia (0.70)
North America > Canada (0.67)
North America > United States > California (0.28)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Technology > Audio & Video (0.96)
Education > Educational Technology > Media (0.86)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
(3 more...)

Add feedback

PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion

Choi, Jaehyun, Hur, Jiwan, Han, Gyojin, Yu, Jaemyung, Kim, Junmo

arXiv.org Artificial IntelligenceMay-29-2025

Video dataset condensation has emerged as a critical technique for addressing the computational challenges associated with large-scale video data processing in deep learning applications. While significant progress has been made in image dataset condensation, the video domain presents unique challenges due to the complex interplay between spatial content and temporal dynamics. This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation, a novel approach that fundamentally reconsiders how video data should be condensed. Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements. Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage, considering the relation of gradients for each frame. Extensive experiments across standard video action recognition benchmarks demonstrate that PRISM outperforms existing disentangled approaches while maintaining compact representations suitable for resource-constrained environments.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2505.22564

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

key frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

GTQuery-based flowOp4cal flow

167bcf2af2cd08fcf75b932022db0311-Paper-Conference.pdf

End-to-end Speech Recognition with similar length speech and text

Re-Identifying Kākā with AI-Automated Video Key Frame Extraction

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

User-Centric Communication Service Provision for Edge-Assisted Mobile Augmented Reality

KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding

PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion